This project takes a look at the economiy in Egypt during the period (1961-2020), using four main economic indicators: Gross Domestic Product, Income per person, Inflation rate, Income Inequality rate, aiming to display trends, explore relationships, and perhaps predict some upcoming future values.
All indicators were collected from Gapminder in CSV format:
gdp)inflation)income)gini)# Import needed libraries
import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt
from matplotlib import dates as mpl_dates
from pandas_profiling import ProfileReport
from functools import reduce
import plotly.express as px
import chart_studio.plotly #For World Map
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
# Load data
gdp = pd.read_csv('total_gdp_us_inflation_adjusted.csv')
inflation = pd.read_csv('inflation_annual_percent.csv')
income = pd.read_csv('income_per_person_gdppercapita_ppp_inflation_adjusted.csv')
gini = pd.read_csv('gini_coefficient.csv')
#Take a look
gini.sample(5)
| country | 1800 | 1801 | 1802 | 1803 | 1804 | 1805 | 1806 | 1807 | 1808 | ... | 2041 | 2042 | 2043 | 2044 | 2045 | 2046 | 2047 | 2048 | 2049 | 2050 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 122 | Mauritania | 39.4 | 39.4 | 39.4 | 39.4 | 39.4 | 39.4 | 39.4 | 39.4 | 39.4 | ... | 32.5 | 32.5 | 32.5 | 32.5 | 32.5 | 32.5 | 32.5 | 32.5 | 32.5 | 32.5 |
| 141 | Papua New Guinea | 24.4 | 24.4 | 24.4 | 24.4 | 24.4 | 24.4 | 24.4 | 24.4 | 24.4 | ... | 42.0 | 42.0 | 42.0 | 42.0 | 42.0 | 42.0 | 42.0 | 42.0 | 42.0 | 42.0 |
| 133 | Nauru | 39.3 | 39.3 | 39.3 | 39.3 | 39.3 | 39.3 | 39.3 | 39.3 | 39.3 | ... | 34.1 | 34.1 | 34.1 | 34.1 | 34.1 | 34.1 | 34.1 | 34.1 | 34.1 | 34.1 |
| 84 | Israel | 34.7 | 34.7 | 34.7 | 34.7 | 34.7 | 34.7 | 34.7 | 34.7 | 34.7 | ... | 39.0 | 39.0 | 39.0 | 39.0 | 39.0 | 39.0 | 39.0 | 39.0 | 39.0 | 39.0 |
| 165 | Slovenia | 21.2 | 21.2 | 21.2 | 21.2 | 21.2 | 21.2 | 21.2 | 21.2 | 21.2 | ... | 25.1 | 25.1 | 25.1 | 25.1 | 25.1 | 25.1 | 25.1 | 25.1 | 25.1 | 25.1 |
5 rows × 252 columns
This is the second step of wrangling data where the inspection of our collected data sets from both the Quality and Tidiness perspectives will be conducted.
First of all, I expect to have one major tidiness issue since I'm only interested in Egypt data, while these dataframes include data for different countries. So in this step, I'll look for data issues taking into consideration this significant point.
# Check GDP dataframe
gdp[gdp.country == 'Egypt']
| country | 1960 | 1961 | 1962 | 1963 | 1964 | 1965 | 1966 | 1967 | 1968 | ... | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 55 | Egypt | 20.3B | 21.3B | 22.2B | 24.5B | 27.3B | 28.7B | 30.1B | 30.4B | 29.9B | ... | 294B | 300B | 307B | 316B | 329B | 344B | 358B | 377B | 398B | 412B |
1 rows × 62 columns
# Check Inflation Rate dataframe
inflation[inflation.country == 'Egypt']
| country | 1961 | 1962 | 1963 | 1964 | 1965 | 1966 | 1967 | 1968 | 1969 | ... | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 56 | Egypt | 1.61 | 0.396 | 0.915 | 0.862 | 5.46 | 2.75 | 2.83 | 1.8 | 0.806 | ... | 11.7 | 19.5 | 8.71 | 11.2 | 9.93 | 6.25 | 22.9 | 21.4 | 13.6 | 5.59 |
1 rows × 61 columns
# Check income dataframe
income[income.country == 'Egypt']
| country | 1800 | 1801 | 1802 | 1803 | 1804 | 1805 | 1806 | 1807 | 1808 | ... | 2041 | 2042 | 2043 | 2044 | 2045 | 2046 | 2047 | 2048 | 2049 | 2050 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 51 | Egypt | 1110 | 1110 | 1110 | 1110 | 1110 | 1110 | 1110 | 1110 | 1110 | ... | 20.9k | 21.3k | 21.8k | 22.2k | 22.7k | 23.2k | 23.7k | 24.2k | 24.7k | 25.2k |
1 rows × 252 columns
# Check gini dataframe
gini[gini.country == 'Egypt']
| country | 1800 | 1801 | 1802 | 1803 | 1804 | 1805 | 1806 | 1807 | 1808 | ... | 2041 | 2042 | 2043 | 2044 | 2045 | 2046 | 2047 | 2048 | 2049 | 2050 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 51 | Egypt | 34.4 | 34.4 | 34.4 | 34.4 | 34.4 | 34.4 | 34.4 | 34.4 | 34.4 | ... | 32.0 | 32.0 | 32.0 | 32.0 | 32.0 | 32.0 | 32.0 | 32.0 | 32.0 | 32.0 |
1 rows × 252 columns
Seems like all is good so far. So the plan is to take care of the main tidiness issue first, then adjust other quality issues related to deletion of unrequired rows or renaming variables.
This is the third step of wrangling data where assessing is put into action, using the define-code-test approach. But before efore beginning in this cleaning process, should create a copy of each data table as a best practice.
# Create copies of the original dataframes to avoid data loss
clean_gdp = gdp.copy()
clean_inflation = inflation.copy()
clean_income = income.copy()
clean_gini = gini.copy()
#Check everything is successfully loaded
clean_gini.head()
| country | 1800 | 1801 | 1802 | 1803 | 1804 | 1805 | 1806 | 1807 | 1808 | ... | 2041 | 2042 | 2043 | 2044 | 2045 | 2046 | 2047 | 2048 | 2049 | 2050 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | 30.5 | 30.5 | 30.5 | 30.5 | 30.5 | 30.5 | 30.5 | 30.5 | 30.5 | ... | 38.2 | 38.2 | 38.2 | 38.2 | 38.2 | 38.2 | 38.2 | 38.2 | 38.2 | 38.2 |
| 1 | Angola | 57.7 | 57.7 | 57.7 | 57.7 | 57.7 | 57.7 | 57.7 | 57.7 | 57.7 | ... | 52.0 | 52.0 | 52.0 | 52.0 | 52.0 | 52.0 | 52.0 | 52.0 | 52.0 | 52.0 |
| 2 | Albania | 39.9 | 39.9 | 39.9 | 39.9 | 39.9 | 39.9 | 39.9 | 39.9 | 39.9 | ... | 33.7 | 33.7 | 33.7 | 33.7 | 33.7 | 33.7 | 33.7 | 33.7 | 33.7 | 33.7 |
| 3 | Andorra | 42.5 | 42.5 | 42.5 | 42.5 | 42.5 | 42.5 | 42.5 | 42.5 | 42.5 | ... | 35.0 | 35.0 | 35.0 | 35.0 | 35.0 | 35.0 | 35.0 | 35.0 | 35.0 | 35.0 |
| 4 | United Arab Emirates | 39.8 | 39.8 | 39.8 | 39.8 | 39.8 | 39.8 | 39.8 | 39.7 | 39.7 | ... | 25.9 | 25.9 | 25.9 | 25.9 | 25.9 | 25.9 | 25.9 | 25.9 | 25.9 | 25.9 |
5 rows × 252 columns
# Drop all rows, but Egypt row in all dataframes
clean_gdp = gdp[gdp.country == 'Egypt']
clean_inflation = inflation[inflation.country == 'Egypt']
clean_income = income[income.country == 'Egypt']
clean_gini = gini[gini.country == 'Egypt']
#Test
clean_inflation
| country | 1961 | 1962 | 1963 | 1964 | 1965 | 1966 | 1967 | 1968 | 1969 | ... | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 56 | Egypt | 1.61 | 0.396 | 0.915 | 0.862 | 5.46 | 2.75 | 2.83 | 1.8 | 0.806 | ... | 11.7 | 19.5 | 8.71 | 11.2 | 9.93 | 6.25 | 22.9 | 21.4 | 13.6 | 5.59 |
1 rows × 61 columns
# Transform the table structure so it makes more sense as years in rows not columns
clean_gdp = clean_gdp.melt(id_vars=["country"], var_name="year", value_name="GDP")
clean_inflation = clean_inflation.melt(id_vars=["country"], var_name="year", value_name="inflation")
clean_income = clean_income.melt(id_vars=["country"], var_name="year", value_name="income")
clean_gini = clean_gini.melt(id_vars=["country"], var_name="year", value_name="gini_coefficient")
#Test
clean_inflation.head()
| country | year | inflation | |
|---|---|---|---|
| 0 | Egypt | 1961 | 1.61 |
| 1 | Egypt | 1962 | 0.396 |
| 2 | Egypt | 1963 | 0.915 |
| 3 | Egypt | 1964 | 0.862 |
| 4 | Egypt | 1965 | 5.46 |
#Add the extracted rows together
df = pd.merge(clean_gdp, clean_inflation,
on=['year', 'country']).merge(clean_income,
on=['year', 'country']).merge(clean_gini,
on=['year', 'country'])
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 60 entries, 0 to 59 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 country 60 non-null object 1 year 60 non-null object 2 GDP 60 non-null object 3 inflation 60 non-null object 4 income 60 non-null object 5 gini_coefficient 60 non-null float64 dtypes: float64(1), object(5) memory usage: 3.3+ KB
# Delete country column
del df['country']
#Transform indicators datatypes to float
#year
df['year'] = df['year'].astype(int)
#inflation
df['inflation'] = df['inflation'].str.replace('−', '-').astype(float)
#Edit some losses
df['inflation'][13] = 9.42
df['inflation'][18] = 23.5
#income
mp = {'k':' * 10**3', 'B':' * 10**3'} #normalize dollars in thousands
df['income'] = pd.eval(df['income'].replace(mp.keys(), mp.values(),
regex=True).str.replace(r'[^\d\.\*]+','', regex=True))
df['income'] = df['income'].astype(float)
#GDP
#Adding another GDP column in dollars for consistency with income
df['GDP_thousands'] = pd.eval(df['GDP'].replace(mp.keys(), mp.values(),
regex=True).str.replace(r'[^\d\.\*]+','', regex=True)).astype(float)
df['GDP'] = df['GDP'].str.replace('B', '').astype(float)
df.rename(columns={'GDP':'GDP_billions'}, inplace=True)
/usr/lib/python3/dist-packages/ipykernel_launcher.py:9: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy /usr/lib/python3/dist-packages/ipykernel_launcher.py:10: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 60 entries, 0 to 59 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 year 60 non-null int64 1 GDP_billions 60 non-null float64 2 inflation 60 non-null float64 3 income 60 non-null float64 4 gini_coefficient 60 non-null float64 5 GDP_thousands 60 non-null float64 dtypes: float64(5), int64(1) memory usage: 5.3 KB
Looks like it's all set to begin the EDA!
This is the stage, I'll investigate the cleaned dataset, aiming to find some useful insights.
Note that all of the EDA is merely tentative.
First of all, I'll start by making a quick report to perform some intial main explorations including:
profile = ProfileReport(df, title="Summary Statistics for Economy in Egypt (1961-2020)")
profile
Interesting! It seems like our dependent variable (economic growth measured by GDP) has a positive relationship with employment, balance of trade and investment. While in the same period of time, it has a negative relationship with inflation. This only makes sense because the first 3 variables, by theory, have a positive effect on economic growth. But regarding the relationship with inflation, it gives us a major hint that inflation in Egypt isn't mainly driven by demand, but mostly with a rise in prices, because theory assumes here and I quote Prof. Colin Cavendish-Jones: "Economic growth causes higher inflation when it is driven by demand. However, if demand and productive capacity increase at the same rate, inflation will normally remain stable. When inflation is caused solely by a rise in the cost of raw materials, it will not be accompanied by economic growth". So let's dig deeper into this and try observing the trends in each section following the guiding questions.
#Super scatterplot
sb.set()
cols = ['inflation', 'income', 'GDP_thousands', 'gini_coefficient']
sb.pairplot(df[cols], height = 3.5)
plt.show();
Although we already saw these distributions in the great portfolio earlier, this super scatter plot gives us a reasonable idea about variables relationships.
One of the figures we may find interesting is the one between income and GDP_thousands. In this figure we can see the dots drawing almost a perfect linear line, which totally makes sense because income is derived from GDP. There's multicollinearity between them. More about how to calculate income here.
The plot concerning income and gini_coefficient can also make us think. It seems like until a certain point income inequality (measured by Gini) was increasing as income increases. However, afterwards it was affected by something (probably major political or economic events) to take an indefinite downward slope.
inflation_series = go.Scatter(
x = df.year,
y = df.inflation,
mode = "lines+markers",
name = "Inflation Rate, GDP deflator(%)",
marker = dict(color = 'rgba(255, 0, 0, 1)')
)
layout = dict(title = 'Inflation Rate in Egypt (1961-2020)', title_font_size=20,
xaxis= dict(title= 'Year',ticklen= 5,zeroline= False)
)
fig = dict(data = inflation_series, layout = layout)
iplot(fig)
This graph shows 3 main peaks in inflation rate, let's look at them and see if we can find any logical reasons:
Taking both into consideration; the graph and the summary statistics for inflation variable provided, during yearsfrom 1961 till 2020, inflation was highly fluctuating leading to unstable prices, with 9.4% on average. Meaning than on average, the prices of goods and services in the Egyptian economy incresed by 9.4% on comparison with the previous year.
#Preparing world inflation
inf_nan= inflation.dropna()
inf_nan['2020'] = inf_nan['2020'].str.replace('−', '-').astype(float)
#Removing outliers
inf_nan = inf_nan.drop([211], 0)
inf_nan = inf_nan.drop([165], 0)
/usr/lib/python3/dist-packages/ipykernel_launcher.py:3: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
inf_nan['2020'].mean()
3.702220481927711
fig = px.choropleth(inf_nan, locations='country', color='2020', title='World Inflation 2020',
locationmode='country names', color_continuous_scale='Reds',
labels={'2020':'Inflation Rate'} )
fig.update_layout(title_font_size=27, title_pad=dict(l=260))
fig.show()
Now, it's apparent that Egypt has a relatively high inflation across the world in 2020. And the latest data shows an inflation of 5.59% which lies above the world inflation average of 3.70%.
income_series = go.Scatter(
x = df.year,
y = df.income,
mode = "lines+markers",
name = "Income (US 2011 Dollars)",
marker = dict(color = 'rgba(0, 128, 155, 1)')
)
layout = dict(title = 'Income in Egypt (1961-2020)', title_font_size=20,
xaxis= dict(title= 'Year',ticklen= 5,zeroline= False)
)
fig = dict(data = income_series, layout = layout)
iplot(fig)
Due to ⚠️limitations⚠️ of national income as an indicator of development, economists like Rostow, Baran and Leibenstein etc. favored the use of per capita income as an index of development. However, it should be noted that Per Capita Income has other kinds of limitations:
And by looking at Income Per Capita in Egypt, we can notice it's has a positive slope, increasing by time from 1961 (the lowest): 1910 US dollars to 2020 (the highest): 11,900 US dollars, with slight diversions. But, taking into consideration the previous limitations, this doesn't mean any actual improvement in the standard of living, nor the economic development in Egypt.
However, we can get a better understanding of the standard of living, by ➡️➡️ putting this Per Capita Income, side by side to Income Inequality (expressied by Gini Coefficient) since the latter indicates how equally is the wealth in an economy distributed.
gdp_series = go.Scatter(
x = df.year,
y = df.GDP_thousands,
mode = "lines+markers",
name = "GDP (Billions of US 2005 Dollars)",
marker = dict(color = 'rgba(0, 153, 0, 1)')
)
layout = dict(title = 'GDP in Egypt (1961-2020)', title_font_size=20,
xaxis= dict(title= 'Year',ticklen= 5,zeroline= False)
)
fig = dict(data = gdp_series, layout = layout)
iplot(fig)
GDP equals the value of all the goods and services produced for money in an economy, evaluated at their market prices. GDP data provides an important and informative snapshot into the behaviour and performance of the overall economy. How so?
The rate of expansion of real GDP is usually interpreted as the most important measure of economic growth. A recession 📉 occurs when real GDP shrinks (usually, for at least two quarters in a row). Whereas a recovery 📈 is said to begin when real GDP starts growing again.
Accordingly, the above graph shows consistently and increasingly growing economic growth. The last recorded value of Economic growth measured by GDP is 412,000 thousands of US dollars in 2020.
However, there's one important ⚠️limitation⚠️ to displaying economic growth: Underground Economy. The underground economy (or black market) refers to cash and barter transactions that are not formally recorded in GDP and are often used to support the trade of illegal goods and services (i.e., drugs, weapons, prostitution, etc.). The scale of underground economies varies greatly between nations, and, in some cases, they make up a substantial percentage of a country’s economic output. The underground market is almost impossible to estimate or value, and due to its illegal nature, it is rarely incorporated into a nation’s published GDP figure. Thus, some nations’ economic output may be understated by GDP.
So it's safe to say that economic growth in Egypt has been growing over years 1961-2020, but not necessarily with the same percentage it's shown here.
⁉️ Another significant question to ask here is: how much of this economic growth is caused by debts and loans? This factor needs to be taken into consideration, if we're looking for indication of real economic development.
inequality_series = go.Scatter(
x = df.year,
y = df.gini_coefficient,
mode = "lines+markers",
name = "Gini Coefficient",
marker = dict(color = 'rgba(153, 51, 255, 1)')
)
layout = dict(title = 'Income Inequality in Egypt (1961-2020)', title_font_size=20,
xaxis= dict(title= 'Year',ticklen= 5,zeroline= False)
)
fig = dict(data = inequality_series, layout = layout)
iplot(fig)
The Gini coefficient amounts to a kind of percentage and can run from 0 to 100. A Gini of 0 represents 0 percent concentration in a country’s income distribution. In a country with a Gini coefficient of 0, everyone receives exactly the same income.
Gini coeffiicient ranges in Egypt 1961-2020 from 25.4% to 32.8%, which is relitavely low, meaning more equality distributing income, right? 🤔
While low numbers represent greater income/wealth equality, low numbers aren't always a perfect indicator of economic health. Nations such as Sweden, Belgium, and Iceland all cluster in the .20s, as do a host of former Soviet nations.4 In the former nations, the numbers are close because residents generally have a high standard of living, while in the latter the close numbers suggest a relatively equal distribution of poverty.
Still, in order to know whether Egypt is in a good place in terms of income equality, at least in the meantime, we need to take a comparative approach and display 2021 Gini Coefficient values for all world countries including Egypt.
#World mean
gini['2020'].mean()
38.64467005076142
fig = px.choropleth(gini, locations='country', color='2020', title='World Gini Coefficient 2020',
locationmode='country names', color_continuous_scale="Purpor",
labels={'2020':'Income Inequality'} )
fig.update_layout(title_font_size=27, title_pad=dict(l=220))
fig.show()
It seems like Egypt's situation on this is good, relatively speaking. In fact, it indicates 32.1 which is below world average of 38.6 in year 2020.
data = [gdp_series, income_series]
layout = dict(title = 'GDP & Income in Egypt (1961-2020)',
xaxis= dict(title= 'Year',ticklen= 5,zeroline= False)
)
fig = dict(data = data, layout = layout)
iplot(fig)
income_decimal = go.Scatter(
x = df.year,
y = df.income.apply(lambda x: x/1000),
mode = "lines+markers",
name = "Income Per Person US Thousands",
marker = dict(color = 'rgba(0, 128, 155, 1)')
) #to show differences clearly
data = [inequality_series, income_decimal]
layout = dict(title = 'Income & Income Inequality in Egypt (1961-2020)',
xaxis= dict(title= 'Year',ticklen= 5,zeroline= False)
)
fig = dict(data = data, layout = layout)
iplot(fig)
⚫️ It's important before interpreting the graph, to note here that Income Per Capita is measured in thousands of USD, while on the other hand, Income Inequality is measured by a specific formula and ranges from 0 to 100 points.
We can notice that as income per person increases, income inequality increases too with some exceptions, like in years:
data = [inflation_series, income_decimal]
layout = dict(title = 'Income & Income Inequality in Egypt (1961-2020)',
xaxis= dict(title= 'Year',ticklen= 5,zeroline= False)
)
fig = dict(data = data, layout = layout)
iplot(fig)
It's equally important here as well before interpreting results, to notice the difference in scales & measurements between Income Per Capita (thousands of USD) and Inflation rate (GDP Deflator as a percentage of GDP).
We can see that unlike income, inflation is higly volatile and unsettled over years 1961-2020 in Egypt, as it fluctuates a lot, in spite of the growing income.
In addition to the limitations previously reported for each single indicator, other limitations for the whole project include:
Economics is a social science, meaning that much of the field is based on human behavior, which can be somewhat irrational and unpredictable. So numbers don't always represent facts here, or to put it more accurately, numbers are diminishing.
The time series in this EDA are only descriptive , but in order to get an in-depth time series analysis that can be used in ML prediction models, it's a must to conduct time series stationarity tests, such as Dickey-Fuller test, followed by seasonality check and other best practices.
With that said, this project tackled the main economic indicators of Egypt in 1961-2021 and it showed that in this same period:
1. Inflation is highly flucuative in Egypt; prices of goods and services aren't stable.
2. Gross Domestic Product and Income Per Capita are in a growing tred, but they're drifting apart across years.
3. Income equality is relatively positive in Egypt, but it's a must to take into consideration factors like corruption and underground economy, because they affect real income which creates inequality.